Use various classifiers (e.g. l1-penalized and ridge logistic regression, and xgboost), for a hotspot detection model.
Geographical levels State and County.
Data Between 2020-05-01 and 2020-08-25, take the
Then, form a covariate matrix by time-lagging these by 0 through 28 days. Additionally, add features by calculating "slopes of JHU IR from the past x days", where \(x \in \{3,6,9,12,15..\}\).
The resulting covariate matrix is roughly 3162 by 193 at the state level, and 52093 by 193 at the county level. We have 953 different counties in our sample.
Response data
Training and test data
We split the data into training/test set at a ratio of 70%/30%, by geographical level.
The training and test data were splitted by stratified sampling to produce a similar ratio of 0s and 1s.
The training set is used for training the hotspot model (e.g. for cross-validation of glmnet() or training xgb()), and,
The test data is used only for creating the ROC or adjusted ROC curves.
Cross-validation folds for lasso & ridge are also formed by stratified sampling, to make sure the ratio of 0s and 1s are similar in training and test data.
Models L1- and L2- penalized logistic regression, SVM, xgboost.
We first visualize the 1's and 0's in the state level data (highlighting by each CV fold and test data, in different color boxes):
## Load data and parameters
obj_state = get_data(geo_type = "state")
list2env(obj_state, envir = .GlobalEnv)
## <environment: R_GlobalEnv>
## Split training/test.
splitted <- stratified_sample_split_geo(df_model, pct_test = 0.3, seed = 100)
## Also split CV folds.
nfold = 5
foldid <- make_stratified_foldid_geo(splitted$df_train, nfold = nfold, seed = 0)
state_splits = lapply(1:nfold, function(ifold)
splitted$df_train[which(foldid==ifold),] %>% select(geo_value) %>% unlist() %>% unique())
## Setup the plot.
par(mfrow = c(13, 4)); par(mar = c(3, 3, 1, 1)); par(cex=.7)
## Collect the geo values, in a particular order
train_geos = lapply(1:nfold, function(ifold){
splitted$df_train[which(foldid==ifold),] %>% select(geo_value) %>% unique()}) %>%
unlist()
test_geos = splitted$df_test %>% select(geo_value) %>% unlist() %>% unique()
geos = c(train_geos, test_geos)
## Make the individual plots
for(geo in geos){
## Form the response data (0's and 1's) for all geos
dat0 = df_model %>% subset(geo_value==geo) %>%
select(time_value,
resp,
incidence = "feature_lag0_confirmed_7dav_incidence_prop_indicator-combination" )
## Combine it with original data for this geo (all time points)
dat = mat %>% as_tibble() %>% filter(geo_value==geo, signal =="confirmed_7dav_incidence_prop") %>%
select(time_value, incidence = value)
dat_combined = dat %>% full_join(dat0, by = c("time_value", "incidence"))
dat_combined = dat_combined %>% mutate(resp=replace(resp, is.na(resp), 0))
## Make the plot
dat_combined %>% with(plot(time_value, incidence, col = resp+1, type='o', pch=16, ylim= c(0,100)))
abline(h = 20, col='grey50', lwd=3)
legend("topleft", legend=toupper(geo), bty='n', cex=3)
## Also add legend for fold id (really bad code)
splitnum = sapply(state_splits, function(mysplit) geo %in% mysplit) %>% which()
if(length(splitnum) == 0){
splitnum = "TEST"
} else {
box(lty=1, col=splitnum + 1, lwd=3)
}
legend("topright", legend = paste0("\nfold ", splitnum),
bty="n", cex=2)
}
We can see that the 0's and 1's are evenly distributed, with somewhat imbalanced class labels -- about 14% of data being 1's and 86% being 0's.
all_tbl = df_model %>% select(resp) %>% table() ##%>% knitr::kable() %>% print()
train_tbl = splitted$df_train %>% select(resp) %>% table() ##%>% knitr::kable() %>% print()
test_tbl = splitted$df_test %>% select(resp) %>% table() ##%>% knitr::kable() %>% print()
all_folds_tbl = lapply(1:nfold, function(ifold){
splitted$df_train[which(foldid==ifold),] %>% select(resp) %>% table() ##%>% knitr::kable() %>% print()
})
cv_folds_tbl = do.call(cbind, all_folds_tbl)
colnames(cv_folds_tbl) = paste0("fold", 1:nfold)
tab = cbind(all = all_tbl,
train = train_tbl,
train = test_tbl,
cv_folds_tbl)
tab = rbind(tab, "Proportion of 1's"= apply(tab, 2, function(a)a[2]/(a[1]+a[2])))
tab %>% knitr::kable(digits=2) %>% print()
| all | train | train | fold1 | fold2 | fold3 | fold4 | fold5 | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2998.00 | 2119.00 | 879.00 | 336.00 | 455.00 | 476.00 | 423.00 | 429.0 |
| 1 | 472.00 | 331.00 | 141.00 | 74.00 | 89.00 | 68.00 | 53.00 | 47.0 |
| Proportion of 1's | 0.14 | 0.14 | 0.14 | 0.18 | 0.16 | 0.12 | 0.11 | 0.1 |
First, download and form data:
obj_state = get_data(geo_type = "state")
obj_county = get_data(geo_type = "county")
Save the results:
save(obj_state, file=file.path(outputdir, "state-hotspot-data.Rmd"))
save(obj_county, file=file.path(outputdir, "county-hotspot-data.Rmd"))
Load them:
load(file=file.path(outputdir, "state-hotspot-data.Rmd"))
load(file=file.path(outputdir, "county-hotspot-data.Rmd"))
Now, we fit the hot-spot model at and visualize the model performances.
The performances of three different classifiers can be visualized as receiver operating characteristic (ROC) curves. Here, thick transparent lines are the results using Facebook data, and thin solid lines are those excluding Facebook data (only using JHU data).
Two plots (the first is for state level data, and second is for county level data ) are repeated five times each different random seeds for the training/test split and CV split.
for(iseed in 0:4){
print(paste0("Random seed" = iseed))
for(geo_type in c("state", "county")){
if(geo_type == "state") list2env(obj_state, envir = globalenv())
if(geo_type == "county") list2env(obj_county, envir = globalenv())
splitted = stratified_sample_split_geo(df_model, pct_test = 0.3, seed = 100+iseed)
response = "confirmed_7dav_incidence_prop"
plots = make_plots(destin = outputdir, splitted, lags, n_ahead, geo_type,
response, fn_response_name, threshold, slope, onset,
geo_cv_split_seed = 0+iseed)
print(plots$roc)
}
}
## [1] "0"
## [1] "1"
## [1] "2"
## [1] "3"
## [1] "4"
Next up: Choosing a summary statistic to measure the advantage in performance using Facebook data, we'll calculate difference in the area under the curve (AUC) between the two versions, for each classifier, as a function of how far ahead hotspots are defined as i.e. for a range of n_ahead, in \(\{10, \cdots, 30\}\).
We perform some simple diagnostics at the state level.
The predicted probabilities are concentrated near 0 and 1.
list2env(obj_state, envir = globalenv())
## <environment: R_GlobalEnv>
splitted = stratified_sample_split_geo(df_model, pct_test = 0.3, seed = 100)
preds = make_preds(destin = outputdir, splitted, lags, n_ahead, geo_type,
response, fn_response_name, threshold, slope, onset,
geo_cv_split_seed = 0,
include_fb = TRUE)
par(mfrow=c(1,3))
hist(preds %>% select(contains("lasso"))%>% unlist(), col='grey80', xlab = "", main=paste0("lasso predicted probabilities"))
hist(preds %>% select(contains("ridge"))%>% unlist(), col='grey80', xlab = "", main=paste0("ridge predicted probabilities"))
hist(preds %>% select(contains("xgb"))%>% unlist(), col='grey80', xlab = "", main=paste0("xgb predicted probabilities"))
The predicted probabilities by test label:
par(mfrow=c(1,3))
testmat = cbind(preds=preds, resp= splitted$df_test$resp) %>% as_tibble()
for(model in c("lasso", "ridge", "xgb")){
res = list("0" = testmat %>% subset(resp==0) %>% select(contains(model)) %>% unlist(),
"1" = testmat %>% subset(resp==1) %>% select(contains(model)) %>% unlist())
boxplot(res, xlab = "Test Labels\n(Hotspots are 1's)",
ylab = "Predicted prob.",
main = paste0("Test set performance (", model, ")"))
}
Let's see predicted probabilities (thick green lines) in test sets overlaid with the JHU case counts (red points are hot spots).
We'll repeat this five times for analyses with different random seeds for training/test data split and CV fold splits, to try to cover more states.
models = c("lasso", "ridge", "xgb")
predcols = RColorBrewer::brewer.pal(3, "Set2")
names(predcols) = models
for(iseed in 0:4){
splitted = stratified_sample_split_geo(df_model, pct_test = 0.3, seed = 100+iseed)
preds = make_preds(destin = outputdir, splitted, lags, n_ahead, geo_type,
response, fn_response_name, threshold, slope, onset,
geo_cv_split_seed = iseed,
include_fb = TRUE)
res = full_join(splitted$df_test %>% select(geo_value, time_value, resp, val=contains("lag0_confirmed")),
preds) %>% as_tibble()
splitted$df_test %>% select(time_value) %>% summary()
preds %>% select(time_value) %>% summary()
par(mfrow=c(8, 2))
par(mar=c(3,4,1,1))
par(oma=c(0,0,10,0))
one_geo_diagnose <- function(df,...){
plot(y=df$val, x=df$time_value, type='o',
col=df$resp+1, pch=16,ylim=c(0,70),
xlab = "Time",
ylab="JHU Household \n Case Proportions")
abline(h=seq(from=0,to=200, by=5), col='grey80', lty=2)
thisgeo = df$geo_value %>% unlist() %>% unique()%>%toupper()
legend("topleft", legend = thisgeo, bty="n", cex=2)
for(model in models){
lines(x=df$time_value, y=(df%>%select(contains(model))%>%unlist())*10, col=predcols[model], lwd=2)
}
## if(thisgeo=="AK"){
## }
return()
}
res %>%
group_by(geo_value) %>%
group_map(one_geo_diagnose, keep=TRUE) -> dummy_obj
mtext(paste0("Random seed=", iseed),
side = 3, outer=TRUE)
legend("topright", lwd=2, col=predcols,
legend=paste0("Predicted prob (x10) of ", models), bg="white")
}